Overview

I decided to start a journal to keep track of the work I am doing in regards to applying LIME to the Hamby bullet data. I have been trying to decide which input values to use for LIME in the forensic examiners’ paper, and I have decided that I would like to simplify matters even more by only focusing on a few input options. I have considered a handful of ways of answering this question (accuracy within one implementation of lime, accuracy across 10 implementations of lime, etc.), but I would like to reserve some of these ways for a future more technical paper. However, I would like a place to store my work to use for future purposes, and it would be nice to have a place to work where I do not feel the need to optimize code. I am hoping that this will be a place where I can write drafts of code and create first versions of graphics. Then I can clean up and organize the code when I write the paper.

Prior to this journal, I had been working on all of my ideas in the R markdown document for the paper, but now I am going to transfer a lot of that code to this journal. Many of the sections below include work that I did earlier this semester. I am just transferring them here now. Later I will move the necessary parts to the paper.

# Load libraries
library(tidyverse)
library(bulletr)
library(randomForest)
library(lime)
library(caret)
library(cowplot)
library(purrr)
library(future)
library(furrr)

# Source functions
source("../code/helper_functions.R")

# Load in the training data (Hamby Data 173 and 252) and testing data (Hamby Data 224 Sets 1 and 11)
hamby173and252_train <- read.csv("../data/hamby173and252_train.csv")
hamby224_test <- read.csv("../data/hamby224_test.csv") # created in the paper

# Read in the test_explain data
hamby224_test_explain <- readRDS("../data/hamby224_test_explain.rds")

# Obtain features used when fitting the rtrees random forest
rf_features <- rownames(rtrees$importance)

Plots of Features

Distributions by samesource

I created the following two plots of the training data as suggested by Heike. The histograms below show the distributions of the features used in the random forest rtrees. The histograms are filled by the samesource variable, which is the truth of whether or not the comparison is from the same barrel and land. The default histograms make it hard to compare the distributions of the matches and non-matches since there are many more comparisons that have samesource == FALSE.

# Create plots of the feature distributions colored by samesource for the training data
hamby173and252_train %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(bins = 30) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Frequency", fill = "Same Source?",
       title = "Hamby 173 and 252 Training Data") +
  theme_bw() +
  theme(legend.position = "bottom")

By setting position = "fill" in the geom_histogram function, it is easier to compare the matches and non-matches. These plots could be used in the future to hand select the bins for lime. Additionally, fitting a logistic regression to this data could also be used to determine the LC50, LC10, ad LC90, which could be used as the bins for lime.

# Create plots of the feature distributions colored by samesource for the training data
# using the position = "fill" option
hamby173and252_train %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(position = "fill", bins = 30) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Proportion", fill = "Same Source?",
       title = "Hamby 173 and 252 Training Data") +
  theme_bw() +
  theme(legend.position = "bottom")

The plots below have the same structure, but they are created with the Hamby 224 testing data. I chose to not separate the testing data by sets, but this is something that could be done later if necessary.

# Create plots of the feature distributions colored by samesource for the testing data
hamby224_test %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(bins = 15) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Frequency", fill = "Same Source?",
       title = "Hamby 224 Testing Data") +
  theme_bw() +
  theme(legend.position = "bottom")

# Create plots of the feature distributions colored by samesource for the testing 
# data using the position = "fill" option
hamby224_test %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(position = "fill", bins = 15) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Proportion", fill = "Same Source?",
       title = "Hamby 224 Testing Data") +
  theme_bw() +
  theme(legend.position = "bottom")

Correlations

I made these plots to look at the correlation between features in the training data within the TRUE and FALSE cases of samesource. The features are highly correlated for the match comparisons. It is clear that the variables are more correlated with the match comparisons than the non-match comparisons. However, there are still some variables that are relatively highly correlation with the non-match comparisons.

# Create a correlation heatmap of the match comparisons in the training data
cor_match <- hamby173and252_train %>%
  select(rf_features, samesource) %>%
  filter(samesource == TRUE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") + 
  labs(x = "", y = "", fill = "Correlation",
       title = "Match Comparisons")

# Create a correlation heatmap of the non-match comparisons in the training data
cor_nonmatch <- hamby173and252_train %>%
  select(rf_features, samesource) %>%
  filter(samesource == FALSE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  labs(x = "", y = "", fill = "Correlation",
       title = "Non-Match Comparisons")

# Create a title for the panel of plots
cor_title <- ggdraw() +
  draw_label("Correlation of Feature Variables in the Training Data", 
             fontface = "bold", 
             size = 16,
             x = 0.5,
             hjust = 0.5)

# Create a joined legend for the panel of plots
cor_legend <- get_legend(cor_match + theme(legend.position = "right"))

# Create the panel of plots
plot_grid(cor_title, 
          plot_grid(cor_match, cor_nonmatch, cor_legend, ncol = 3, rel_widths = c(2, 2, 0.5)),
          ncol = 1,
          rel_heights = c(0.25, 3))

The plots below show the correlations for the testing data. The patterns in the plots look really similar to ones of the training data.

# Create a correlation heatmap of the match comparisons in the training data
cor_match_test <- hamby224_test %>%
  select(rf_features, samesource) %>%
  filter(samesource == TRUE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") + 
  labs(x = "", y = "", fill = "Correlation",
       title = "Match Comparisons")

# Create a correlation heatmap of the non-match comparisons in the training data
cor_nonmatch_test <- hamby224_test %>%
  select(rf_features, samesource) %>%
  filter(samesource == FALSE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  labs(x = "", y = "", fill = "Correlation",
       title = "Non-Match Comparisons")

# Create a title for the panel of plots
cor_title_test <- ggdraw() +
  draw_label("Correlation of Feature Variables in the Testing Data", 
             fontface = "bold", 
             size = 16,
             x = 0.5,
             hjust = 0.5)

# Create a joined legend for the panel of plots
cor_legend_test <- get_legend(cor_match_test + theme(legend.position = "right"))

# Create the panel of plots
plot_grid(cor_title_test, 
          plot_grid(cor_match_test, cor_nonmatch_test, cor_legend_test, 
                    ncol = 3, rel_widths = c(2, 2, 0.5)),
          ncol = 1,
          rel_heights = c(0.25, 3))

Visualizations of Explanations

Heike made a plot with the structure of the one below when we began with the default settings of lime. This one has been created from the lime explanations with 2 quantile bins. It includes all three of the chosen features for each case in the testing dataset. For both sets, the cutoff of 0.275 < ccf occurs the most frequently.

hamby224_test_explain %>%
  filter(nbins == "2", quantile_bins == TRUE) %>%
  ggplot(aes(x = feature_desc)) +
  geom_bar() +
  coord_flip() + 
  facet_grid(set ~ .)

The plot below is the model for the one that will be used in the app for exploring the lime explanations from the bullet matching data. This is the data from set 1 of the testing dataset.

hamby224_test %>%
  filter(set == "Set 1") %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2,
             text = paste('Bullets Compared: ', bullet1, "-", land1, 
                          "vs", bullet2, "-", land2,
                          '\nRandom Forest Score: ', 
                          ifelse(is.na(rfscore), "Missing due to tank rash", rfscore)))) +
  geom_tile(aes(fill = rfscore)) +
  facet_grid(bullet2 ~ bullet1, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "darkgrey", high = "darkorange", midpoint = 0.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) +
  labs(x = "", y = "", fill = "RF Score")

Draft of LIME Procedure

I wrote this up to start thinking about how to describe the procedure that the lime R package uses to implement the LIME algorithm. It needs a lot of work, but it is a start. The final version of this will end up in the technical stats paper critiquing LIME. The version in the firearm examiner’s paper will be much simpler…

The steps below explain the procedure that the R package is using to apply the LIME algorithm to the bullet matching predictions on the Hamby 224 clone dataset made by the random forest model from Hare. For simplicity, the steps are described as what happens to one case in the test data. Thus, the steps (2) through (7) are repeated for each observation in the testing dataset.

Let \[Y_{jk} = \begin{cases} 1 & \mbox{ if bullets } j \mbox{ and } k \mbox{ were fired from the same gun barrel }\\ 0 & \mbox{otherwise} \end{cases}\] be the response variable in the training dataset, and \(X_1,...,X_9\) correspond to the nine features in the training dataset. Let \(X'_1,...,X'_9\) be the

  1. Distributions for each of the features in the training data are obtained.
    • The method that lime uses to obtain the distribution differs based on the feature type. All of the features in the Hamby datasets are numeric. For numeric features, the default option in lime (quantile_bins = TRUE) computes the quantiles of each feature based on the number of bins selected. The default number of bins is 4 (n_bins = 4).
  2. Many (\(n\)) samples from each of the feature distributions are drawn.
    • To do this, lime has several options (mostly quoted from lime package for now):
    • bin_continuous = TRUE should continuous variables be binned?
    • quantile_bins = TRUE should the ins for n_bins be based on quantiles or spread evenly
    • n_bins = 4 number of bins if bin_continuous is TRUE
    • use_density = TRUE if bin_continuous is FALSE, should continuous data be sampled using kernel density estimation (if not, then will assume normal for continuous variable)
  3. Predictions for the testing data using the random forest model are computed.
    • The random forest model rtrees is used to make a prediction for the observation from the test dataset and each of the \(n=5000\) samples as to whether or not the comparison of the two bullets in the test case are a match. Since the random forest is a classification model, lime is set to return the prediction probabilities.
  4. Similarity score between the observation in the testing data and each of the \(n=5000\) sampled values are obtained.
    • The way that the similarity score is computed depends on the type of feature. Since all of the features in the Hamby 224 test dataset are continuous, the simulated values are first converted into 0-1 features where a 1 indicates that the feature from the simulated value falls in the same bin as the observed data point and a 0 indicates that the feature is not in the same bin as the observed data point. Then, by default, the Gower distance is used to compute the similarity score. (using the gower package in R)
  5. Feature selection is performed by fitting some type of regression model weighted by the similarity scores is to the simulated data and the observed value. The 0-1 versions of the features are used.
    • The user can specify the number of features, \(m\), they would like to select to explain the prediction. lime supports the following options for feature selection
    • forward selection with ridge regression
    • highest weight with ridge regression
    • LASSO model
    • tree model
    • default: forward selection if \(m\le6\) with a ridge regression model, highest weight with ridge regression otherwise
  6. A ridge regression model is fit as the simple model by regressing the prediction probabilities on the \(m\) selected predictor variables and weighted by the similarity scores. If the response is categorical, the user can select how many categories and which categories they want to explain. \[P(Match = TRUE) = \beta_0 + \beta_1 \cdot I\left[X_1 \in \mbox{obs bin}\right] + \beta_2 \cdot I\left[X_2 \in \mbox{obs bin}\right] + \beta_3 \cdot I\left[X_3 \in \mbox{obs bin}\right]\] For the prediction of interest, \[P(Match = TRUE) = \beta_0 + \beta_1 + \beta_2 + \beta_3.\]
  7. The feature weights are extracted and used as the explanations.

Note: I realized that if bin_continuous = FALSE, then bins are not used at all. Instead, a kernel density estimator is used to sample from the distribution (or a normal distribution if specified), and then the ridge regression models are fit without “numerified” values.

Caret vs RandomForest

The lime function in lime is set up to work with specific packages. For example, lime works with a random forest model fit using the caret package, but it is not set up to work with a random forest fit using the randomForest package. I found a suggestion to apply the function as_classifier from the lime package to a model fit using randomForest in order for the lime function to accept the model. It seemed to work. However, I wanted to compare the lime results from a model fit using caret and the same model fit using randomForest. To do this, I used the iris data. The code below goes through the process of fiting the two models (rf_model and caret_model). Then the lime and explain functions are applied to both models.

# Code for comparing the output from LIME when the model is 
# fit with caret and randomForest
# Last Updated: 2018/11/13

## Set up -----------------------------------------------------------------------------

# Split up the data set
iris_test <- iris[1:5, 1:4]
iris_train <- iris[-(1:5), 1:4]
iris_lab <- iris[[5]][-(1:5)]

## LIME with caret --------------------------------------------------------------------

# Create Random Forest model on iris data
caret_model <- train(iris_train, iris_lab, method = 'rf')

# Create an explainer object
caret_explainer <- lime(iris_train, caret_model)

# Explain new observation
caret_explanation <- explain(iris_test, caret_explainer, n_labels = 1, n_features = 4)

## LIME with randomForest -------------------------------------------------------------

rf_model <- randomForest(iris_train, iris_lab)

# Create an explainer object
rf_explainer <- lime(iris_train, model = as_classifier(rf_model))

# Explain new observation
rf_explanation <- explain(iris_test, rf_explainer, n_labels = 1, n_features = 4)

To compare the lime explanations from the two models, I extracted the \(R^2\) value from the simple model, the simple model intercept, the simple model prediction, the feature values, and the feature weights from both explanation datasets. I computed the MSE between each of these values from the two models, and I plotted them on a scatterplot. Since lime is based on random permutations, I would not expect the values from the two models to be exact. However, I would like them to be close. The MSEs are all close to zero, and the plots suggest that the values do an okay job of following the 1-1 line. We decided this seems like the two versions of the explanations are reasonably exchangeable.

## Comparisons -----------------------------------------------------------------------

# Grab the numeric caret explanation variables of interest
caret_numeric <- caret_explanation %>%
  select(model_r2, model_intercept, model_prediction, feature_value, feature_weight) %>%
  gather(key = "variable", value = "caret_value")

# Grab the numeric randomForest explanation variables of interest
rf_numeric <- rf_explanation %>%
  select(model_r2, model_intercept, model_prediction, feature_value, feature_weight) %>%
  gather(key = "variable", value = "rf_value")

# Join the two
lime_numeric <- caret_numeric %>%
  mutate(rf_value = rf_numeric$rf_value, 
         variable = factor(variable))

# Look at the MSEs for the variables
lime_numeric %>%
  group_by(variable) %>%
  summarise(MSE = sum((caret_value - rf_value)^2) / dim(lime_numeric)[1])
## # A tibble: 5 x 2
##   variable               MSE
##   <fct>                <dbl>
## 1 feature_value    0        
## 2 feature_weight   0.000145 
## 3 model_intercept  0.0000487
## 4 model_prediction 0.000107 
## 5 model_r2         0.0000534
# Scatterplots of the randomForest versus caret variable values
ggplot(lime_numeric, aes(x = caret_value, y = rf_value)) + 
  geom_point() + 
  facet_wrap( ~ variable, scales = "free") + 
  geom_abline(intercept = 0, slope = 1)

Comparing Furrr Times

The amount of time it took me to run the explain function with all of my input options was getting pretty long. Heike suggest that I use the furrr package, which implements the purrr functions using the speeed of the future package. I ran and timed the code below to see how much faster the code was. Using the multiprocess option, the code took about half the amount of time to run (from 218.188 seconds to 108.731 seconds)!

# It took about half the time when using the function from furrr!

library(tictoc)
library(future)
library(furrr)

# Slow way
plan(sequential)
tictoc::tic()
sensitivity_explain <- future_pmap(.l = as.list(sensitivity_inputs %>% 
                                                         select(-case)),
                                          .f = run_lime, # run_lime is one of my helper functions
                                          train = hamby173and252_train %>% select(rf_features),
                                          test = hamby224_test %>% arrange(case) %>% select(rf_features) %>% na.omit(),
                                          rfmodel = as_classifier(rtrees),
                                          label = "TRUE",
                                          nfeatures = 3,
                                          seed = FALSE)
tictoc::toc()
# 218.188 sec elapsed

# Fast way
plan(multiprocess)
tictoc::tic()
sensitivity_explain <- future_pmap(.l = as.list(sensitivity_inputs %>% 
                                                         select(-case)),
                                          .f = run_lime, # run_lime is one of my helper functions
                                          train = hamby173and252_train %>% select(rf_features),
                                          test = hamby224_test %>% arrange(case) %>% select(rf_features) %>% na.omit(),
                                          rfmodel = as_classifier(rtrees),
                                          label = "TRUE",
                                          nfeatures = 3,
                                          seed = FALSE)
tictoc::toc()
# 107.731 sec elapsed

Tree Based Bins

Functions

Heike wrote the function treebink to take x and y variables for fitting a tree and returning the cuts for \(k\) bins based on the splits from the tree. Here is an example using the function below with the predictor variable of ccf.

treebink(y = hamby173and252_train$samesource, x = hamby173and252_train$ccf, k = 5)
## [1] 0.4935619 0.5885747 0.6882110 0.7783968

I wrote the function mylime, which adds an option in the lime function from the LIME package to use Heike’s function treebink to obtain tree based bins. The output from mylime can then be input into the explain function to obtain explanations. I was able to rerun all of the code for creating the hamby224_test_explain dataset with the explanations from the tree based bins included for 2 to 6 bins. I’ve been working on redoing code from earlier in this journal to work for the added binning method.

There seems to be a problem with treebink with certain variables with high enough bins. The function returns more than the desired number of bins. Heike is going to look into this.

Assessing Lime Explanations

When we began looking at the explanations from lime with the default settings, we did not think that they made sense. This led me to try applying LIME with a handful of input values. However, since LIME is based on random permutations, I was curious to know how consistent the results were. This led me to try running each implementation for specific starting values a handful of times. The next two sections consider the results from these studies.

Accuracy

The figures in this section are created from the implementations of LIME on the set 1 from the training data for the input options of 2 to 6 quantile bins, 2 to 6 equally spaced bins, kernel density estimation, and normal distribution approximation. Each set of input values was only run once.

# Read in the lime comparison data (the comparisons are run in the firearm examiner
# paper since they will be discussed in the paper)
hamby224_lime_comparisons <- readRDS("../data/hamby224_lime_comparisons.rds")

Complex versus Simple Model Predictions

The plot below compares the predictions from the “simple model” (the ridge regression model) and the “complex model” (the random forest rtrees) on the x-axis from the lime implementations with bin estimation (bin_continuous == TRUE). The simple model predictions are on the y-axis, and the complex model predictions are on the x-axis. The plot is faceted by number of bins and whether or not the bins are equally spaced or based on quantiles. The points are colored by the \(R^2\) value from the fit of the simple model. The lines are linear regression lines fit to the data points within a facet. We would hope that there is a linear relationship between these two variables. None of the cases show strong linear trends, but some are more linear than others. The quantile bins show that the simple model never makes a prediction over 0.6, whereas the random forest model can have predictions of up to 1. The equally spaced bins do have probabilities that exceed 0.6, but only with 3 and 6 bins. I noticed that within the facets, the points are in mostly horizontal strips, and the number of strips is about the number of bins from the lime implementation.

hamby224_lime_comparisons %>%
  filter(!is.na(rfscore), set == "Set 1", bin_continuous == TRUE) %>%
  ggplot(aes(x = rfscore, y = model_prediction)) + 
  geom_point(aes(color = model_r2)) + 
  facet_grid(bin_method ~ nbins) + 
  theme_bw() + 
  stat_smooth(method = "lm", se = FALSE, size = 0.5) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The plot below shows the absolute value of the difference between the complex model prediction and the simple model prediction versus the complex model prediction. The points are faceted by number of bins and whether the bins are equally spaced or based on quantiles. Again, the points are colored by the \(R^2\) values from the simple model. All cases show a v-shaped trend. The low part of the v occurs around a random forest score of about 0.25. That is, the simple model is most accurately portraying the complex model predictions around the random forest score of 0.25. It gets worse near the extremes. However, it is interesting to note that with the equally spaced bins, the absolute difference decreases near 1 for 3 to 6 bins. This is not the case with the quantile bins. It appears that the equally spaced bins are able to make slightly better predictions than the quantile bins when there is a high probability of a match.

hamby224_lime_comparisons %>%
  filter(!is.na(rfscore), set == "Set 1", bin_continuous == TRUE) %>%
  ggplot(aes(x = rfscore, y = abs(diff))) + 
  geom_point(aes(color = model_r2)) + 
  facet_grid(bin_method ~ nbins) + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

At some point, I got the idea in my head that it would be interesting to look at the “residuals” (difference in complex and simple model predictions) by feature. The plot below shows one example with the data from 3 equally spaced bins. There are clear trends in the plots, but I am not quite sure what to make of this or how to use it. This may be something to return to.

hamby224_lime_comparisons %>%
  filter(quantile_bins == FALSE, bin_method == "equally_spaced", nbins == "3", set == "Set 1") %>%
  select(rf_features, diff, -case) %>%
  gather(key = feature, value = feature_value, rf_features) %>%
  select(feature, feature_value, diff) %>%
  ggplot(aes(x = feature_value, y = diff)) + 
  geom_point() + 
  facet_wrap(~ feature, scales = "free") + 
  geom_hline(yintercept = 0, color = "blue")

Comparing Input Values by MSE and \(R^2\)

In order to assess which lime implementation is doing the best job of capturing the predictions from the random forest model, I decided to calculate the mean squared error for each of the input situations. I defined the mean squared error as \[\frac{\sum_{i=1}^n (\hat{p}_{\mbox{simple},i}-\hat{p}_{\mbox{complex},i})^2}{n}\] where \(n\) is the number of observations in the testing dataset (within a set). Additionally, I was curious to compare the fits of the ridge regression models across input values, so I calculated the average \(R^2\) for each input situation.

hamby224_lime_results <- hamby224_lime_comparisons %>%
  filter(!is.na(rfscore)) %>%
  group_by(situation, bin_situation, bin_continuous, quantile_bins, nbins, use_density, 
           bin_method, set) %>%
  summarise(mse = (sum(diff^2)) / length(diff),
            ave_r2 = mean(model_r2)) %>%
  arrange(set) %>%
  ungroup()

The plot below shows the mean squared errors for each of the input situations faceted by set. In both cases, the lowest mean squared error occurs with 3 equally spaced bins. This seems to suggest that using 3 equally spaced bins should provide better explanations.

hamby224_lime_results %>%
  ggplot(aes(x = situation, y = mse, color = bin_situation)) +
  geom_point() + 
  facet_grid( ~ set) + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = "bottom") + 
  labs(x = "Situation", y = "MSE", color = "")

The plot below shows the average \(R^2\) value for each of the input situations faceted by set. The results are very similar for each set, and the highest \(R^2\) values occur for 2 quantile bins. The \(R^2\) values for 3 equally spaced bins are in the middle.

hamby224_lime_results %>%
  ggplot(aes(x = situation, y = ave_r2, color = bin_situation)) +
  geom_point() + 
  facet_grid( ~ set) + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom") + 
  labs(x = "Situation", y = "Average R2", color = "")

Variability

In order to assess how variable the LIME results are, I ran each of the 12 input situations 10 times. I wanted to look at the variability of the mse for each situation, and I wanted to see how consistent the choice of selected features is by LIME.

# Perform the sensitivity analysis if not already saved
if(!file.exists("../data/hamby224_sensitivity_inputs.rds")) {
  
  # Specify the number of reps and input cases
  nreps = 10
  noptions = 12
  
  # Specify the inputs for the sensitivity analysis
  hamby224_sensitivity_inputs <- data.frame(bin_continuous = c(rep(TRUE, nreps * 10), 
                                                      rep(FALSE, nreps * 2)),
                                   quantile_bins = c(rep(c(TRUE, FALSE), each = nreps * 5), 
                                                     rep(TRUE, nreps * 2)),
                                   nbins = c(rep(rep(2:6, each = nreps), 2), 
                                             rep(4, nreps * 2)),
                                   use_density = c(rep(TRUE, nreps * 11), 
                                                   rep(FALSE, nreps))) %>%
    mutate(case = 1:(nreps * noptions)) %>%
    select(case, bin_continuous, nbins, quantile_bins, use_density)
  
  # Tell R to run the upcoming code in parallel
  plan(multiprocess)

  # Run lime for the sensitivity analysis
  hamby224_sensitivity_outputs <- future_pmap(.l = as.list(hamby224_sensitivity_inputs %>%
                                                             select(-case)),
             .f = run_lime, # run_lime is one of my helper functions
             train = hamby173and252_train %>% select(rf_features),
             test = hamby224_test %>% arrange(case) %>% select(rf_features) %>% na.omit(),
             rfmodel = as_classifier(rtrees),
             label = "TRUE",
             nfeatures = 3,
             seed = FALSE) %>%
    map_df(function(list) list$explain) %>%
    mutate(rep = factor(rep(rep(1:nreps, each = dim(hamby224_test %>% na.omit())[1] * 3),
                            noptions))) %>%
    full_join(hamby224_test %>% na.omit() %>% 
                mutate(case = as.character(case)), by = "case") %>%
    mutate(case = factor(case)) %>%
    select(case, model_r2:feature_weight, bin_continuous:rep, set:land2, rfscore, samesource)
  
  # Save the sensitivity inputs and outputs
  saveRDS(hamby224_sensitivity_inputs, "../data/hamby224_sensitivity_inputs.rds")
  saveRDS(hamby224_sensitivity_outputs, "../data/hamby224_sensitivity_outputs.rds")
  
} else {
  
  # Load in the sensitivity inputs and outputs
  hamby224_sensitivity_inputs <- readRDS("../data/hamby224_sensitivity_inputs.rds")
  hamby224_sensitivity_outputs <- readRDS("../data/hamby224_sensitivity_outputs.rds")

}
hamby224_sensitivity_results <- hamby224_sensitivity_outputs %>%
  mutate(situation = ifelse(bin_continuous == TRUE & quantile_bins == TRUE, 
                            sprintf("%.0f quantile bins", nbins),
                            ifelse(bin_continuous == TRUE & quantile_bins == FALSE, 
                                   sprintf("%.0f equally spaced bins", nbins),
                                   ifelse(bin_continuous == FALSE & use_density == TRUE, 
                                          "kernel density", "normal approximation"))),
         diff = rfscore - model_prediction) %>%
  mutate(situation = fct_relevel(situation, 
                                 "2 quantile bins", 
                                 "3 quantile bins",
                                 "4 quantile bins",
                                 "5 quantile bins",
                                 "6 quantile bins")) %>%
  mutate(bin_situation = ifelse(bin_continuous == TRUE & quantile_bins == TRUE,
                                "quantile bins",
                                ifelse(bin_continuous == TRUE & quantile_bins == FALSE,
                                       "equally spaced bins", "other")))
hamby224_sensitivity_mses <- hamby224_sensitivity_results %>%
  group_by(set, rep, situation, bin_situation) %>%
  summarise(mse = sum((diff)^2) / length(diff))

The plot below shows boxplots of the 10 mean squared errors computed for each of the input situations for each set. There does not appear to be much variation for most of the cases. The largest amount of variation occurs with 4 and 5 equally spaced bins. There is moderate variability for 3 equally spaced bins, but all of the 10 runs resulted in much lower mses than any of the other situations. This still seems to support the choice of using 3 equally spaced quantile bins for both sets.

hamby224_sensitivity_mses %>%
  ggplot(aes(x = situation, y = mse)) + 
  geom_boxplot(aes(color = bin_situation)) + 
  geom_point() + 
  facet_grid( ~ set) + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(x = "", y = "MSE")

hamby224_consistency <- hamby224_sensitivity_results %>%
  select(set, situation, bin_situation, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(set, situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(set, situation, case, rep) %>%
  slice(1) %>%
  ungroup() %>%
  group_by(set, situation, bin_situation, case) %>%
  summarise(nlevels = length(levels(factor(feature))),
            count = n()) %>%
  ungroup() %>%
  group_by(set, situation, bin_situation) %>%
  summarise(mean_nlevels = mean(nlevels),
            sd_nlevels = sd(nlevels),
            upper = mean(nlevels) + sd(nlevels),
            lower = mean(nlevels) - sd(nlevels),
            max = max(nlevels),
            min = min(nlevels))

To assess the consistency of the results, I determined the number of different features chosen by LIME as the top feature within a test case across the 10 replicates. I then computed the average number of different top features chosen by LIME within a input situations for each set. The plot below contains these values shown as the dots. The error bars represent one standard deviation above and below the mean. Note that the number of levels cannot go below 1, but the error bars go below 1. I have decided to ignore this for now, but I would need to come up with a better representation if I used this for something.

There are a handful of situations with a mean near one and very little variation. However, the largest mean is around 2, which is not very high. This seems to suggest that LIME is relatively consistent across runs.

hamby224_consistency %>%
  ggplot(aes(x = situation, y = mean_nlevels)) + 
  geom_point(aes(color = bin_situation)) + 
  geom_errorbar(aes(ymin = lower, ymax = upper)) +
  facet_grid(. ~ set) + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The bar chart below shows the proportion of different number of top features chosen across all test cases within a set for each input situation. This shows that the most different features chosen was four. However, many of the cases only have one or two different features chosen, and usually, a larger proportion of the cases have only one top feature chosen across the 10 reps.

hamby224_sensitivity_results %>%
  select(set, situation, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(set, situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(set, situation, case, rep) %>%
  slice(1) %>%
  ungroup() %>%
  group_by(set, situation, case) %>%
  summarise(nlevels = length(levels(factor(feature)))) %>%
  ggplot(aes(x = situation, fill = factor(nlevels))) + 
  geom_bar(position = "stack") +
  facet_grid( ~ set) + 
  theme(legend.position = "bottom", 
        axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_fill_brewer(palette = "Blues")

The plot below needs some work, but it does contain some interesting information. Right now, I am only concentrating on set 1. We can see that with some lime settings, all cases have the same “best” variable chosen, and other settings have multiple important variables that depend on the case. We can also see that different variables are selected as important depending on the number of bins. It is also interesting to note that when comparing the quantile bins to the equally spaced bins, this plot shows that the equally spaced bins tend to choose the same first variable for all cases. On the other hand, the quantile bins choose different first variables for the cases.

The fact that different features get chosen as the top feature for different estimation techniques makes sense. You would expect that the number of bins would determine which features are better suited for predicting whether or not a comparison is a match. For example, if you look back at the feature distribution plots, you can see that ccf is the obvious choice for determining between matches and non-matches if you have to choose two equally spaced bins. This is the feature that lime most frequently chooses. You can make similar arguments for other cases as well. Even though different inputs select different variables, I would expect that some variables will be better at prediction than others. I think this is what leads some input values to better LIME explanations and lower MSEs than others.

In the meeting with Heike, we talked about how this plot suggests that LIME is not doing a very good job of providing local explanations. Regardless of the case, it is often suggesting only one of two variables are important, and the variable importance depends on the number of bins. In this plot, I would like consistency in the x-direction, but I would like variability in the y-direction to better understand the variables that are important for a specific case.

hamby224_sensitivity_results %>%
  filter(set == "Set 1") %>%
  droplevels() %>%
  select(situation, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(situation, case, rep) %>%
  slice(1) %>%
  ungroup() %>%
  group_by(situation, case) %>%
  mutate(nlevels = length(levels(factor(feature)))) %>%
  ungroup() %>%
  arrange(situation, desc(nlevels), case) %>%
  mutate(order = rep(rep(1:length(levels(case)), each = 10), 12)) %>%
  ggplot(aes(x = order, fill = feature)) + 
  geom_bar(position = "stack", width = 1) + 
  coord_flip() + 
  facet_wrap(situation ~ ., ncol = 5) +
  #facet_grid(nlevels ~ situation) +
  theme(legend.position = "bottom",
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_brewer(palette = "Blues") + 
  labs(x = "cases")

I was thinking a bit more about the relationship between the top features chosen and the type of binning used. I realized it would be helpful to view the relationship between the features and the random forest score. The plots below show the testing data with random forest score versus each of the features. The points are colored by the samesource variable.

hamby224_test %>% 
  select(rf_features, rfscore, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, rfscore, samesource) %>%
  ggplot(aes(x = value, y = rfscore, color = samesource)) + 
  geom_point() + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Random Forest Score", color = "Same Source?",
       title = "Hamby 224 Testing Data") +
  theme_bw() +
  theme(legend.position = "bottom")

Input Rankings

In order to decide on which LIME settings to use for the firearm examiners’ paper, I decided to create a table that ranks each of the 12 input settings by MSE from the single implementation, the average MSE across the 10 implementations, the average number of top features chosen across the 10 implementations, and the average \(R^2\) value from the single implementation. The first table below is for set 1, and the second table below is for set 11.

While using these to help me decide which input settings to use, I had a handful of thoughts:

  • Since I am still working on making sense of the variation results, and it would be nice to get the first paper out, I am thinking of basing the decision of input values entirely off of only the assessment measures from the one case.
  • Even with needing more time to think about how to understand the meaning of the results from the 10 reps, the mean MSE from these has very similar results to the MSEs from the one rep, and it does not appear that the top chosen features vary too much between the reps. This makes me think that it would be okay to base my decision off of the results from the single rep.
  • Both Heike and I do not think that the \(R^2\) is that helpful of a measure for determining whether the LIME explanations are good, so I am not going to put much weight on it.
  • Additionally, I do not want to consider either the normal approximation or kernel density estimate since I have not figured out a good way to visualize these in the app. Right now, the feature plots from these explanations are not very helpful. I would need to adjust the intercept or something.

Based on the MSEs from the single implementation of LIME, I will use 3 equally spaced bins for both the set 1 and set 11 explanations in the firearm examiner paper since it leads to the lowest MSE for both sets.

joined_results <- hamby224_lime_results %>%
  ungroup() %>%
  mutate(nbins = as.numeric(as.character(nbins))) %>%
  mutate(situation = ifelse(bin_continuous == TRUE & quantile_bins == TRUE, 
                            sprintf("%.0f quantile bins", nbins),
                            ifelse(bin_continuous == TRUE & quantile_bins == FALSE, 
                                   sprintf("%.0f equally spaced bins", nbins),
                                   ifelse(bin_continuous == FALSE & use_density == TRUE, 
                                          "kernel density", "normal approximation")))) %>%
  mutate(situation = fct_relevel(situation, 
                                 "2 quantile bins", 
                                 "3 quantile bins",
                                 "4 quantile bins",
                                 "5 quantile bins",
                                 "6 quantile bins")) %>%
  select(set, situation, mse, ave_r2) %>%
  full_join(hamby224_sensitivity_mses %>%
              group_by(set, situation) %>%
              summarise(mean_sensitivity_mse = mean(mse), 
                        sd_sensitivity_mse = sd(mse)),
            by = c("set", "situation")) %>%
  full_join(hamby224_consistency %>% select(set, situation, mean_nlevels, sd_nlevels), 
            by = c("set", "situation")) %>%
  group_by(set) %>%
  mutate(mse_order = dense_rank(mse),
         ave_r2_order = dense_rank(desc(ave_r2)),
         mean_sensitivity_mse_order = dense_rank(mean_sensitivity_mse),
         mean_nlevels_order = dense_rank(mean_nlevels))
joined_results %>%
  ungroup() %>%
  filter(set == "Set 1") %>%
  select(situation, mse_order, mean_sensitivity_mse_order, mean_nlevels_order, ave_r2_order) %>%
  arrange(mse_order, mean_sensitivity_mse_order, mean_nlevels_order, ave_r2_order) %>%
  DT::datatable(options = list(pageLength = 12, paging = FALSE, searching = FALSE),
                rownames = FALSE,
                colnames = c("Situation", "MSE", "Mean MSE", "Mean Number of Top Features", 
                             "Mean R2"),
                caption = "Results from Set 1")
joined_results %>%
  ungroup() %>%
  filter(set == "Set 11") %>%
  select(situation, mse_order, mean_sensitivity_mse_order, mean_nlevels_order, ave_r2_order) %>%
  arrange(mse_order, mean_sensitivity_mse_order, mean_nlevels_order, ave_r2_order) %>%
  DT::datatable(options = list(pageLength = 12, paging = FALSE, searching = FALSE),
                rownames = FALSE,
                colnames = c("Situation", "MSE", "Mean MSE", "Mean Number of Top Features", 
                             "Mean R2"),
                caption = "Results from Set 11")

Heike’s Suggestions

  • We talked about including the penalty for the number of parameters in the discussion of the paper.
  • We talked about using a tree to choose the bins. This would allow us to automate the process and to obtain a penalty parameter since the trees give us nesting. (MSE + lambda * p where p is the from the number of trees multiplied by the number of …)
  • Look at the AUC after binning

Results from One Implementation

# Selects only the top features for the cases where binning was used
top_features <- hamby224_test_explain %>%
  filter(bin_situation != "other") %>%
  select(set, situation, bin_situation, case, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(situation, case, desc(feature_weight_abs)) %>%
  group_by(situation, case) %>%
  slice(1)

# Plots the top feature for each case for each bin situation
ggplot(top_features, aes(x = situation, y = case, fill = feature)) + 
  geom_tile() + 
  facet_grid(set ~ bin_situation, scales = "free") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_fill_brewer(palette = "Blues")

# Plots the proportions of number of top features chosen across number of bins for a case
# within a binning method
top_features %>%
  group_by(set, case, bin_situation) %>%
  summarise(nfeatures = length(levels(factor(feature)))) %>%
  ggplot(aes(x = bin_situation, fill = factor(nfeatures))) + 
  geom_bar(position = "stack") + 
  facet_grid(set ~ .) + 
  theme_bw() +
  scale_fill_brewer(palette = "Blues")

Assessing Variability

In order to assess how variable the LIME results are, I ran each of the 12 input situations 10 times. I wanted to look at the variability of the mse for each situation, and I wanted to see how consistent the choice of selected features is by LIME.

# Perform the sensitivity analysis if not already saved
if(!file.exists("../data/hamby224_sensitivity_inputs.rds")) {
  
  # Specify the number of reps and input cases
  nreps = 10
  noptions = 12
  
  # Specify the inputs for the sensitivity analysis
  hamby224_sensitivity_inputs <- data.frame(bin_continuous = c(rep(TRUE, nreps * 10), 
                                                      rep(FALSE, nreps * 2)),
                                   quantile_bins = c(rep(c(TRUE, FALSE), each = nreps * 5), 
                                                     rep(TRUE, nreps * 2)),
                                   nbins = c(rep(rep(2:6, each = nreps), 2), 
                                             rep(4, nreps * 2)),
                                   use_density = c(rep(TRUE, nreps * 11), 
                                                   rep(FALSE, nreps))) %>%
    mutate(case = 1:(nreps * noptions)) %>%
    select(case, bin_continuous, nbins, quantile_bins, use_density)
  
  # Tell R to run the upcoming code in parallel
  plan(multiprocess)

  # Run lime for the sensitivity analysis
  hamby224_sensitivity_outputs <- future_pmap(.l = as.list(hamby224_sensitivity_inputs %>%
                                                             select(-case)),
             .f = run_lime, # run_lime is one of my helper functions
             train = hamby173and252_train %>% select(rf_features),
             test = hamby224_test %>% arrange(case) %>% select(rf_features) %>% na.omit(),
             rfmodel = as_classifier(rtrees),
             label = "TRUE",
             nfeatures = 3,
             seed = FALSE) %>%
    map_df(function(list) list$explain) %>%
    mutate(rep = factor(rep(rep(1:nreps, each = dim(hamby224_test %>% na.omit())[1] * 3),
                            noptions))) %>%
    full_join(hamby224_test %>% na.omit() %>% 
                mutate(case = as.character(case)), by = "case") %>%
    mutate(case = factor(case)) %>%
    select(case, model_r2:feature_weight, bin_continuous:rep, set:land2, rfscore, samesource)
  
  # Save the sensitivity inputs and outputs
  saveRDS(hamby224_sensitivity_inputs, "../data/hamby224_sensitivity_inputs.rds")
  saveRDS(hamby224_sensitivity_outputs, "../data/hamby224_sensitivity_outputs.rds")
  
} else {
  
  # Load in the sensitivity inputs and outputs
  hamby224_sensitivity_inputs <- readRDS("../data/hamby224_sensitivity_inputs.rds")
  hamby224_sensitivity_outputs <- readRDS("../data/hamby224_sensitivity_outputs.rds")

}
hamby224_sensitivity_results <- hamby224_sensitivity_outputs %>%
  mutate(situation = ifelse(bin_continuous == TRUE & quantile_bins == TRUE, 
                            sprintf("%.0f quantile bins", nbins),
                            ifelse(bin_continuous == TRUE & quantile_bins == FALSE, 
                                   sprintf("%.0f equally spaced bins", nbins),
                                   ifelse(bin_continuous == FALSE & use_density == TRUE, 
                                          "kernel density", "normal approximation"))),
         diff = rfscore - model_prediction) %>%
  mutate(situation = fct_relevel(situation, 
                                 "2 quantile bins", 
                                 "3 quantile bins",
                                 "4 quantile bins",
                                 "5 quantile bins",
                                 "6 quantile bins")) %>%
  mutate(bin_situation = ifelse(bin_continuous == TRUE & quantile_bins == TRUE,
                                "quantile bins",
                                ifelse(bin_continuous == TRUE & quantile_bins == FALSE,
                                       "equally spaced bins", "other")))
hamby224_sensitivity_mses <- hamby224_sensitivity_results %>%
  group_by(set, rep, situation, bin_situation) %>%
  summarise(mse = sum((diff)^2) / length(diff))

The plot below shows boxplots of the 10 mean squared errors computed for each of the input situations for each set. There does not appear to be much variation for most of the cases. The largest amount of variation occurs with 4 and 5 equally spaced bins. There is moderate variability for 3 equally spaced bins, but all of the 10 runs resulted in much lower mses than any of the other situations. This still seems to support the choice of using 3 equally spaced quantile bins for both sets.

hamby224_sensitivity_mses %>%
  ggplot(aes(x = situation, y = mse)) + 
  geom_boxplot(aes(color = bin_situation)) + 
  geom_point() + 
  facet_grid( ~ set) + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(x = "", y = "MSE")

hamby224_consistency <- hamby224_sensitivity_results %>%
  select(set, situation, bin_situation, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(set, situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(set, situation, case, rep) %>%
  slice(1) %>%
  ungroup() %>%
  group_by(set, situation, bin_situation, case) %>%
  summarise(nlevels = length(levels(factor(feature))),
            count = n()) %>%
  ungroup() %>%
  group_by(set, situation, bin_situation) %>%
  summarise(mean_nlevels = mean(nlevels),
            sd_nlevels = sd(nlevels),
            upper = mean(nlevels) + sd(nlevels),
            lower = mean(nlevels) - sd(nlevels),
            max = max(nlevels),
            min = min(nlevels))

To assess the consistency of the results, I determined the number of different features chosen by LIME as the top feature within a test case across the 10 replicates. I then computed the average number of different top features chosen by LIME within a input situations for each set. The plot below contains these values shown as the dots. The error bars represent one standard deviation above and below the mean. Note that the number of levels cannot go below 1, but the error bars go below 1. I have decided to ignore this for now, but I would need to come up with a better representation if I used this for something.

There are a handful of situations with a mean near one and very little variation. However, the largest mean is around 2, which is not very high. This seems to suggest that LIME is relatively consistent across runs.

hamby224_consistency %>%
  ggplot(aes(x = situation, y = mean_nlevels)) + 
  geom_point(aes(color = bin_situation)) + 
  geom_errorbar(aes(ymin = lower, ymax = upper)) +
  facet_grid(. ~ set) + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The bar chart below shows the proportion of different number of top features chosen across all test cases within a set for each input situation. This shows that the most different features chosen was four. However, many of the cases only have one or two different features chosen, and usually, a larger proportion of the cases have only one top feature chosen across the 10 reps.

hamby224_sensitivity_results %>%
  select(set, situation, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(set, situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(set, situation, case, rep) %>%
  slice(1) %>%
  ungroup() %>%
  group_by(set, situation, case) %>%
  summarise(nlevels = length(levels(factor(feature)))) %>%
  ggplot(aes(x = situation, fill = factor(nlevels))) + 
  geom_bar(position = "stack") +
  facet_grid( ~ set) + 
  theme(legend.position = "bottom", 
        axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_fill_brewer(palette = "Blues")

The plot below needs some work, but it does contain some interesting information. Right now, I am only concentrating on set 1. We can see that with some lime settings, all cases have the same “best” variable chosen, and other settings have multiple important variables that depend on the case. We can also see that different variables are selected as important depending on the number of bins. It is also interesting to note that when comparing the quantile bins to the equally spaced bins, this plot shows that the equally spaced bins tend to choose the same first variable for all cases. On the other hand, the quantile bins choose different first variables for the cases.

The fact that different features get chosen as the top feature for different estimation techniques makes sense. You would expect that the number of bins would determine which features are better suited for predicting whether or not a comparison is a match. For example, if you look back at the feature distribution plots, you can see that ccf is the obvious choice for determining between matches and non-matches if you have to choose two equally spaced bins. This is the feature that lime most frequently chooses. You can make similar arguments for other cases as well. Even though different inputs select different variables, I would expect that some variables will be better at prediction than others. I think this is what leads some input values to better LIME explanations and lower MSEs than others.

In the meeting with Heike, we talked about how this plot suggests that LIME is not doing a very good job of providing local explanations. Regardless of the case, it is often suggesting only one of two variables are important, and the variable importance depends on the number of bins. In this plot, I would like consistency in the x-direction, but I would like variability in the y-direction to better understand the variables that are important for a specific case.

hamby224_sensitivity_results %>%
  filter(set == "Set 1") %>%
  droplevels() %>%
  select(situation, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(situation, case, rep) %>%
  slice(1) %>%
  ungroup() %>%
  group_by(situation, case) %>%
  mutate(nlevels = length(levels(factor(feature)))) %>%
  ungroup() %>%
  arrange(situation, desc(nlevels), case) %>%
  mutate(order = rep(rep(1:length(levels(case)), each = 10), 12)) %>%
  ggplot(aes(x = order, fill = feature)) + 
  geom_bar(position = "stack", width = 1) + 
  coord_flip() + 
  facet_wrap(situation ~ ., ncol = 5) +
  #facet_grid(nlevels ~ situation) +
  theme(legend.position = "bottom",
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_brewer(palette = "Blues") + 
  labs(x = "cases")